Skip to content

Add grapheme_levenshtein function. #18087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

youkidearitai
Copy link
Contributor

Measure levenshtein for grapheme cluster unit.
RFC: https://wiki.php.net/rfc/grapheme_levenshtein

@TimWolla TimWolla changed the title [Require RFC][Draft] Add grapheme_levenshtein function. Add grapheme_levenshtein function. Mar 16, 2025
@youkidearitai youkidearitai marked this pull request as ready for review April 16, 2025 00:41
@youkidearitai
Copy link
Contributor Author

This RFC was accepted. I will go to implementation.

@devnexen
Copy link
Member

Nice, could you possibly rebase with last master state to redo the CI ? cheers.

youkidearitai and others added 2 commits April 16, 2025 15:06
@youkidearitai
Copy link
Contributor Author

I rebased and force pushed. I'm waiting for CI end.

@devnexen
Copy link
Member

should be fine, thanks. Looking good but I ll go back at it in couple of hours.

Copy link
Member

@devnexen devnexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minus the remark

if (ustring1) {
efree(ustring1);
}
RETURN_FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that false is being returned on failure, rather than throwing an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, Userland can uses intl_get_error_message function after this function if this error. Therefore, this block is returns false.

@youkidearitai
Copy link
Contributor Author

Hmm... Windows CI is failed.

/* Set global error code. */
intl_error_set_code(NULL, ustatus1);

/* Set error messages. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of comments are pointless, this just is the english version of the code, please avoid these kinds of comments.

/* Set error messages. */
intl_error_set_custom_msg(NULL, "Error converting input string to UTF-16", 0);
if (ustring1) {
efree(ustring1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to NULL check this.

/* Set error messages. */
intl_error_set_custom_msg(NULL, "Error converting input string to UTF-16", 0);
if (ustring2) {
efree(ustring2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to NULL check this (and same below).

/* Set global error code. */
intl_error_set_code(NULL, ustatus2);

/* Set error messages. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pointless comment.

intl_convert_utf8_to_utf16(&ustring2, &ustring2_len, pstr2, ZSTR_LEN(string2), &ustatus2);

if (U_FAILURE(ustatus2)) {
/* Set global error code. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pointless comment.

intl_convert_utf8_to_utf16(&ustring1, &ustring1_len, pstr1, ZSTR_LEN(string1), &ustatus1);

if (U_FAILURE(ustatus1)) {
/* Set global error code. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pointless comment.


unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE];
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE];
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the error checks for these calls and the next ones?

int32_t pos1 = 0;
int32_t pos2 = 0;
int32_t usrch_pos = 0;
while (pos1 != UBRK_DONE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you should change this to while (true) because otherwise it's confusing which check actually matters

RETURN_THROWS();
}

zend_long *p1, *p2, *tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where possible, I think it's better to merge the declaration and assignment as it reduces the area/scope where something is valid/can be used.

if (pos2 == UBRK_DONE) {
break;
}
usrch_pos = grapheme_strpos_utf16(pstr1 + current1, pos1 - current1, pstr2 + current2, pos2 - current2, 0, NULL, 0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function call will make the conversion from UTF-8 to UTF-16 again and it seems very inefficient to do it this way. This is also the equivalent of strrpos which doesn't seem to make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed.
However, usearch.h requires UTF-16 because UChar is UTF-16.
Maybe, possible to use u_uastrcpy to usearch_open.
Please give me a time. I'll try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified internal encoding is UTF-16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants